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(54) System and method for thematically analyzing and annotating an audio-visual sequence 



(57) This disclosure describes a method and system 
for creating an annotated analysis of the thematic con- 
tent of a film or video work. The annotations may refer 



to single frames, orto sequences of consecutive frames. 
The sequences of frames for a given theme may overlap 
with one or more single frame or sequence of frames 
from one or more other themes In the work. 
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Description 

BACKGROUND OF THE INVE^^■|ON 

[0001] The present Invention relates to the processing 
of movie or video material, more specifically to the man- 
ual, semi-automatic, or automatic annotation of themat- 
ically-based events and sequences within the material. 
[0002] As initialiy conceived, movies and television 
programs were intended to be viewed as linear, sequen- 
tial time experiences, that is, they ran from beginning to 
end, In accordance to the intent of the creator of the 
piece and at the pacing detennined during the editing of 
theworic. However, under some circumstances a viewer 
may wish to avoid a linear viewing experience. For ex- 
ample, the viewer may wish only a synopsis of the work, 
or may wish to browse, index, search, or catalog all or 
a portion of a worlc. 

[0003] With the advent of recording devices and per- 
sonal entertainment systems, control over pacing and 
presentation order fell more and more to the viewer. The 
video cassette recorder (VCR) provided primitive func- 
tionality including pause, rewind, fast fonvard and fast 
reverse, thus enabling simple control over the flow of 
time In the experience of the work. However, the level 
of control was necessarily cmde and limited. With the 
advent of laser discs, the level of control moved to 
frame-accurate cuIng, thus increasing the flexibility of 
the viewing experience. However, no simple Indexing 
scheme was available to pennitthe viewer to locate and 
view only specific segments of the video on demand. 
[0004] Modern computer technology has enabled 
storage of and random access to digitized film and video 
sources. The DVD has brought compressed digitized 
movies Into the hands of the viewer, and has provided 
a simple level of access, namely chapter-based brows- 
ing and viewing. 

[0005] Standard movie and film editing technology is 
based on the notion of a 'shot', which Is defined as a 
single series of images which constitutes an entity within 
the story line of the wori^. Shots are by definition non- 
overlapping, contiguous elements. A 'scene' Is made up 
of one or more shots, and a complete movie or video 
work comprises a plurality of scenes. 
[0006] Video analysis for database Indexing, archiv- 
ing and retrieval has also advanced in recent years. Al- 
gorithms and systems have been developed for auto- 
matic scene analysis, Including feature recognition; mo- 
tion detection; fade, cut, and dissolve detection; and 
voice recognition. However, these analysis tools are 
based upon the notion of a shot or sequence, one of a 
series of non-overlapping series of images that form the 
second level constituents of a work, just above the sin- 
gle frame. For display and analysis purposes, a work is 
often depicted as a tree structure, wherein the work Is 
subdivided into discrete sequences, each of which may 
be further subdivided. Each sequence at the leaf posi- 
tions of such a tree is disjoint from all other leaf nodes. 



When woricing Interactively with such a structure, each 
node may be represented by a representative frame 
from the sequence, and algorithms exist for automati- 
cally extracting key frames from a sequence. 
5 [0007] Whereas this method of analyzing, annotating 
and depicting a film or video work Is useful, It exhibits a 
fundamental limitation inherent in the definition of a 
'shot* , Suppose for a moment that a shot consisted of a 
single frame. If more than one object appears in that 
10 frame, then the frame can be thought of as having at 
least two thematic elements, but the content of the shot 
Is limited to a singular descriptor. This limitation may be 
avoided by creating a multiplicity of shots, each of which 
contains a unique combination of objects or thematic el- 
IS ements, then giving each a unique descriptor. However, 
such an approach becomes completely intractable for 
all but the most degenerate plot structures. 
[0008] The intricate interplay between content and 
themes has long been recognized in written literature, 
20 and automated and semi-automated algorithms and 
systems have appeared to perfonn thematic analysis 
and classification of audible or machine-readable text. 
A single chapter, paragraph or sentence may advance 
or contribute multiple themes, so often no clear dlstinc- 
2S tion or relationship can be Inferred or defined between 
specific subdivisions of the text and overiying themes or 
motifs of the work. Themes supercede the syntactic sub- 
divisions of the text, and must be described and anno- 
tated as often-concun-ent parallel elements that are elu- 
30 cidated throughout the text. 

[0009] Some elements of prior art have attempted to 
perform this type of analysis on video sequences. Abe- 
cassis, in a series of patents, perfected the notion of 'cat- 
egories' as a method of analysis, and described the use 
35 of "video content preferences" which refer to "preestab- 
lished and clearly defined preferences as to the manner 
orfomi (e.g. explicitness) in which a story/game is pre- 
sented, and the absence of undesirable matter (e.g. pro- 
fanity) In the story/game" (U.S. Patent 5,434,678; see 
40 also U.S. 5,589,945, U.S. 5,664.046, U.S. 5,684.918, 
U.S. 5,696,869, U.S. 5,724,472, U.S. 5,987.211 , U.S.' 
6,011,895, U.S. 6,067,401, and U.S. 6,072.934.) Abe- 
cassis further extends the notion of "video content pref- 
erences" to include "types of programs/games (e.g. In- 
45 teractive video detective games), or broad subject mat- 
ter (e.g. mysteries)." Inherent in Abecassis' art is the no- 
tion that the content categories can be defined exclusive 
of the thematic content of the film or video, and that a 
viewer can predefine a series of choices along these 
50 predefined categories with which to filter the content of 
the work. Abecassis does not take into account the plot 
or thematic elements that make up the wori<, but rather 
focuses on the manner or fomn in which these elements 
are presented. 

55 [001 0] In a more comprehensive approach to the sub- 
ject. Benson et al. (U.S. Patent 5.574, 845) describe a 
system for describing and viewing video data based up- 
on models of the video sequence, Including time, space, 
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object and event, the event model being most similar to 
the subject of the current disclosure. In '845, the event 
model is defined as a sequence of possibly-overlapping 
episodes, each of which is characterized by elements 
from time and space models which also describe the vid- 
eo, and objects from the object model of the video. How- 
ever, this description of the video is a strictly structural 
one, In that the models of the video developed in '845 
do not take into account the syntactic, semantic, orsem- 
iotlc content or significance of the 'events' depicted in 
the video. In a similar way, Benson et al. pennit overiap- 
ping events, but this overiap is strictly of the fomi "Event 
A contains one or more of Event B", whereas thematic 
segmentation can and will produce overiapping seg- 
ments in all general relationships. 
[0011] The automatic assignment of thematic signifi- 
cance to video segments is beyond the capability of cur- 
rent computer systems. Methods exist in the art for de- 
tecting scene cuts, fades and dissolves; for detecting 
and analyzing camera and object motion in video se- 
quences; for detecting and tracking objects in a series 
of images; for detecting and reading text within images; 
and for making sophisticated analyses and transforma- 
tions of video images. However, the assignment of con- 
textual meaning to any of this data must presently be 
done, or at least be augmented, by the intervention of 
an expert who groups simpler elements of analysis like 
key frames and shots, and assigns meaning and signif- 
icance to them in terms of the themes or concepts which 
the work exposlts. 

[0012] What Is required is a method of thematically 
analyzing and annotating the linear time sequence of a 
film or video work, where thematic elements can exist 
In parallel with one another, and where the occurrence 
of one thematic element can overiap the occurrence of 
another thematic element. 

SUMMARY OF THE INVENTION 

[0013] This disclosure describes a method and sys- 
tem for creating an annotated analysis of the thematic 
content of a film or video work. The annotations may 
refer to single frames, or to sequences of consecutive 
frames. The sequences of frames for a given theme may 
overiap with one or more single frame or sequence of 
frames from one or more other themes in the work. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0014] 

FIG 1 illustrates a video sequence timeline with an- 
notations appended according to a prefen'ed em- 
bodiment of the invention. 
FIG. 2 is a schematic view of the video sequence 
timeline of FIG. 1 with the sequence expressed as 
a linear sequence of frames. 
FIG. 3 is a schematic view of one frame of the video 



sequence of FIG, 2. 

FIG. 4 is a schematic view of a magnified view of 

the portion of the frame of FIG. 3, 

FIG. 5 Is a flow diagram illustrating the preferred 
5 method for retrieving and displaying a desired video 

sequence from compressed video data. 

FIG. 6 Is a schematic diagram of nested menus from 

a graphic user interface according to the invention 

to enable selection of appropriate video segments 
10 from the entire video sequence by the user of the 

system. 

DETAILED DESCRIPTION 

15 [0015] The high level description of the current inven- 
tion refers to the timeline description of a video se- 
quence 1 0, which is shown schematically in FIG. 1 . Any 
series of video images may be labeled with annotations 
that designate scenes 1 2a-1 2e, scene boundaries 1 4a- 

20 1 4d (shown by the dotted lines), key frames, presence 
of objects or persons, and other similar structural, logi- 
cal, functional, or thematic descriptions. Here, objective 
elements such as the appearance of two characters 
(Jimmy and Jane) within the video frame and their par- 

25 ticipation within a dance number are shown as blocks 
which are associated with certain portions of the video 
sequence 10. 

[0016] The dashed lines linking the blocks serve to 
highlight the association between pairs of events, which 

30 might be assigned thematic significance. In this short 
example, Jimmy enters the field of view at the beginning 
of a scene in block 16. Later in the same scene, Jane 
enters in block 1 8. A scene change 1 4b occurs, but Jim- 
my and Jane are still In view. They begin to dance to- 

35 gether starting from block 20, and dance for a short pe- 
riod until block 22. After a brief interval, the scene chang- 
es again at 14c, and shortly thereafter Jimmy leaves the 
camera's view in block 24. Some time later the scene 
changiBs again at 1 4d, and Jane has now left the cam- 

40 era's view In block 26. 

[0017] FIG. 1 demonstrates the potentially overiap- 
ping nature of thematic elements, their disjuncturefrom 
simple scene boundaries 14l-14d, and the necessary 
overlay of meaning and significance on the mere 

45 'events' that is required for thematic analysis. The expert 
who perfomns the analysis will address questions such 
as, "How is the dance number in this portion of the work 
related to other actions, objects, and persons in other 
portions of the work?" From a series of such questions, 

50. annotations are created which engender contextual and 
analytical meaning to individual frames and series of 
frames within the video. 

[0018] The processing of generating annotations for 
a film or video work proceeds as follows, if the work is 
55 compressed, as for example using MPEG-2 compres- 
sion, it is decompressed. An example of a compressed 
portion of a video sequence is shown in FIG. 2. The se- 
quence shown is comprised of a series of frames that 
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are intended to be shown sequentially on a timeline. 
Standard video Is shot at thirty franDes per second and, 
at least In the case of compressed video such as MPEG- 
2. includes approximately two base frames ("l-frames") 
per second of video shot to fomi two sets of fifteen frame 
Group-of-Picture (GOP) segments. The MPEG-2 stand- 
ard operates to compress video data by storing changes 
In subsequent frames from previous frames. Thus, one 
would nomially be unable to completely and accurately 
decompress a random frame using the MPEG-2 stand- 
ard without knowing the context of surrounding frames. 
Base frames, such as base frames B1 and C1 , are com- 
plete in and of themselves and thus can be decom- 
pressed without refemng to previous frames. Each base 
frame is associated with subsequent regular frames - 
for instance, frame 81 is related to frames B2-B15 to 
present a complete half-second of video. 
[0019] Once decompressed, the expert viewer of the 
list or user of the interactive tool then can view, create, 
edit, annotate, or delete these attributes assigned to cer- 
tain frames of the video. In addition, higher-level at- 
tributes can be added to the annotation list. Each such 
thematic attribute receives a text label, which describes 
the content of the attribute. As thematic attributes are 
created and labeled, they are assigned to classes or 
sets, each of which represents one on-going analytical 
feature of the work. For example, each appearance of 
a particular actor may be labeled and assigned to the 
plotline involving the actor. Additionally, a subset of 
those appearances may be grouped together into a dif- 
ferent thematic set, as representative of the develop- 
ment of a particular idea or motif in the woric. Appear- 
ances of multiple actors may be grouped, and combined 
with objects seen within the work. The combinations of 
attributes which can be created are limited only by the 
skill, imagination and understanding of the expert per- 
fomiing the annotation. 

[0020] Automatic or semi-automatic analysis tools 
might be used to detennine first level attributes of the 
film, such as scene boundaries 14; the presence of ac- 
tors, either generally or by specific identity; the presence 
of specific objects; the occurrence of decipherable text 
in the video Images; zoom or pan camera movements; 
motion analysis; or other algorlthmically'derivable at- 
tributes of the video Images. These attributes are then 
presented for visual Inspection, either by means of a list 
of the attributes, or preferentially by means of an inter- 
active computer tool that shows various types and levels 
of attributes, possibly along with a timeline of the video 
and with key frames associated with the corresponding 
attribute annotations. 

[0021] The annotations fomi a metadata description 
of the content of the work. As with other metadata like 
the Dublin Core (http://puri.org/dc), these metadata can 
be stored separate from the work itself, and utilized In 
isolation from or in combination with the work. The meta- 
data annotation of the work might be utilized by an in- 
teractive viewing system that can present the viewer 



with alternative choices of viewing the work, 
[0022] The annotation metadata takes two forms. The 
low-level annotation consists of a type indicator, start 
time, duration or stop time, and a polnterto a label string. 
5 The type indicator may refer to a person, event, object, 
text, or other similar structural element. The start and 
stop times may be given in absolute temis using the tim- 
ing labels of the original work, or in relative values from 
the beginning of the woric, or any other convenient ref- 
10 erence point. Labeling is done by indirection to facilitate 
the production of altemative-language versions of the 
metadata. 

[0023] In the prefen-ed implementation, the woric is 
compressed using the MPEG-2 video compression 
^5 standard after the annotation work is completed, and 
care is taken to align Group-of-Picture (GOP) segments 
with significant key frames in the annotation, to facilitate 
the search and display process. Preferentially, each key 
frame is encoded as an MPEG l-frame. which may be 
20 at the beginning of a GOP (as in frame B1 and 01 in 
FIG. 2), so that the key frame can be searched to and 
displayed efficiently when the metadata is being used 
for viewing or scanning the wori<. In this case, the com- 
pression processing necessitates an additional step re- 
25 quired to connect frame time with file position within the 
video sequence data stream. The nature of the MPEG- 
2 compression standard is such that elapsed time in a 
work is not lineariy related to file position within the re- 
sulting data stream. Thus, an Index must be created to 
30 convert between frame time, which Is typically given in 
SMPTE time code fomiat 'hh:mm:ss:ff' 34 (FIG. 4), with 
stream position, which is a byte/bit offset into the raw 
data stream. This index may be utilized by converting 
the annotation start time values to stream offsets, or by 
-35 maintaining a separate temporal Index that relates 
SMPTE start time to offset. 

[0024] The second-level thematic annotations utilize 
the first-level structural annotations. Each thematic an- 
notation consists of a type indicator, a polnterto a label. 
40 and a polnterto the first of a linked list of elements, each 
of which is a reference to either a first-level annotation, 
or another thematic annotation. The type indicators can 
either be generic, such as action sequence, dance 
number, or song; or be specific to the particular work, 
45 such as actor- or actress-specific, or a particular plot 
thread. All thematic Indicators within a given work are 
unique. The element references may be by element type 
and start time, or by direct positional reference within 
the metadata file itself. 
50 [0025] Every frame oftheworic must appear in at least 
one thematic element. This permits the viewer to select 
all themes, and view the entire work. 
[0026] The second-level thematic annotations may be 
organized into a hierarchy This hierarchy may be in- 
55 ferred from the relationships among the annotations 
themselves, or indicated directly by means of a number 
or labeling scheme. For example, annotations with type 
indicators within a certain range might represent parent 
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elements to those annotations within another certain 
range, and so forth. Such a hierarchy of structure Is cre- 
ated during the generation of the annotation data, and 
is used during the display of the metadata or the under- 
lying work. 

[0027] The metadata are stored in a structured file, 
which may Itself be compressed by any of a number of 
standard technologies to make storage and transmis- 
sion more efficient. 

[0028] The time representation may be In fractional 
seconds or by other means, rather than SMPTE frame 
times. 

[0029] FIGs. 3 and 4 Illustrates the data structure with- 
in a sample frame such as frame B7. The frame B7 in- 
cludes a header 28, a data portion 30, and a footer 32. 
The data portion 30 includes the video data used (in con- 
junction with data derived from previous decompressed 
frames) to display the frame and all the objects present- 
ed within it. The header 28 uniquely identifies the frame 
by including a timecode portion 34, which sets forth the 
absolute time of play within the video sequence and the 
frame number. The header 28 also Includes an offset 
portion 36 that identifies In bytes the location of the clos- 
est previous l-frame B1 so that the base frame can be 
consulted by the decoder and the Identified frame 87 
subsequently accurately decompressed. 
[0030] The decoding procedure operates as shown in 
flow diagram of FIG. 5. The user is presented with a 
choice of themes or events within the video sequence. 
As shown In FIG. 6, for instance, the user may select 
the desired portion of the video by first moving through 
a series of graphic user interface menu lists displayed 
on the video monitor on which the user is to view the 
video, A theme list is presented in menu display 40 com- 
prised of, for instance, the themes of romance, conflict, 
and travel - each identified and selectable by navigating 
between labeled buttons 42a, 42b, and 42c, respective- 
ly. The selected theme will include a playlist, stored in 
memory, associated with that theme. Here, the 'ro- 
mance' theme Is selected by activating button 42a and 
playlist submenu 46 is displayed to the user. The playlist 
submenu 46 lists the video segment groupings associ- 
ated with the theme selected in menu 40. Here, the play- 
list for romance includes the following permutations: 
'man#1 with woman#1' at labeled button 4Ba, 'man#2 
with woman#1' at labeled button 48b, and 'man#1 with 
woman #2' at button 48c. Further selection of a playlist, 
such as'selectlon of playlist 48b, yields the presentation 
to the user of a segment list in segment submenu 50. 
The segment submenu 50 has listed thereon a plurality 
of segments 52a, 52b, and 52c appropriate to the theme 
and playlist. 

[0031] Creating the annotation list occurs in reverse, 
where the video technical creating the annotative meta- 
data selects segments of the video sequence being an- 
notated - each segment including a begin and end frame 
- and associates an annotation with that segment. Ob- 
ject annotations can be automatically derived, such as 



by a character recognition program or other known 
means, or manually input after thematic analysis of the 
underlying events and context of the video segment to 
the entire work. Annotations can be grouped in nested 
menu structures, such as shown in FIG. 6, to ease the 
selection and placement of annotated video segments 
within the playback tree structure. 
[0032] The selected segment In FIG. 6, here segment 
52b showing the first date between man#2 and wom- 
an#1 under the romance theme, begins at some start 
time and ends at some end time which are associated 
with a particular portion of the video sequence from a 
particular start frame to an end frame. In the flow dia- 
gram shown in FIG. 5, the start frame for the selected 
video segment is identified in block 60 by consulting the 
lookup table; and the base frame location derived from 
it in block 62 as by reading the offset existing In the start 
frame. The decoder then starts decoding from the iden- 
tified base frame in block 64 but only starts displaying 
the segment from the start frame in block 66. The display 
of the segment Is ended in block 68 when the frame hav- 
ing the appropriate timecode 34 is decoded and dis- 
played. 

[0033] Refen-ing back to FIG. 2, for instance, suppos- 
ing a short (e.g. half second) segment Is selected for 
view by the user, the system looks up the location of the 
frames associated with the segment within a table. In 
this case, the segment starts with frame 84 and ends 
with segment C6. The decoder reads the offset of frame 
84 to identify the base l-frame 81 and begins decoding 
from that point. The display system, however, does not 
display any frame until 84 and stops at frame C6. Play 
of the segment is then complete and the user is prompt- 
ed to select another segment for play by the user inter- 
face shown in FIG. 6. 

[0034] These concepts can be extended to nonlinear 
time sequences, such as multimedia presentations, 
where at least some portion of the presentation consists 
of linear material. This applies also to audio streams, 
video previews, advertising segments, animation se- 
quences, stepwise transactions, or any process that re- 
quires a temporally sequential series of events that may 
be classified on a thematic basis. 
[0035] Having described and Illustrated the principles 
of the invention In a preferred embodiment thereof, it 
should be apparent that the invention can be modified 
in arrangement and detail without departing from such 
principles. We claim all modifications and variation com- 
ing within the spirit and scope of the following claims. 



Claims 

1. A method for generating annotations of viewable 
segments within a video sequence comprising the 
steps of: 

selecting a start frame from a video sequence; 
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selecting an end frame from a video sequence 
to fomi in conjunction with the selected start 
frame a designated video segment; 
associating an attribute wrtli the designated 
video segment; and 5 
storing the attn'bute as metadata within a 
lookup table for subsequent selection and pres- 
entation of the designated video segment to a 
viewer. 

10 

2. The method of claim 1 , further including the step of 
automatically annotating scene division metadata 
within the lookup table. 



times are given in absolute tenns. 

10. The method of claim 7 wherein the start and stop 
times are given in relative temis to a reference point 
within the video sequence. 

11. The method of claim 7, wherein said metadata In- 
cludes a second-level annotation comprising a type 
indicator, a pointer to a label, and a pointer to a first 
of a linked iist of elements. 

12. The method of claim 1 , further including the steps 
of: 



3. The method of claim 1 , further including the step of is 
annotating a video segment responsive to an auto- 
mated object recognition sytem. 

.4. The method of claim 3, wherein the objects auto- 
matically recognized by the system include a first- 20 
level attribute selected from the group consisting of 
scene boundaries, the presence of actors, the pres- 
ence of specific objects, the occurrence of deci- 
pherable text in the video images, zoom or pan 
camera movements, or motion analysis. 25 

5. The method of claim 1 , further including the steps 
of: 



presenting for visual inspection a list of the at- 
tributes contemporaneous with a timeline of the 
video sequence; 

selecting at least one attribute from the list; and 
perfomning the associating step responsive to 
the step of selecting at least one attribute from 
the list. 



selecting a second start frame from a video se- 30 

quence; 

selecting a second end frame from a video se- 
quence to fonn in conjunction with the selected 
second start frame a second designated video 
segment, wherein said second designated vid- 35 
eo segment at least partially overiaps with said 
designated video segment; 
associating a second attribute with the second 
designated video segment; and 
storing the second attribute as metadata within 40 
the lookup table for subsequent selection and 
presentation of the second designated video 
segment to a viewer. 

6, The method of claim 1 wherein said annotation in- 
eludes a plurality of elements including a structural 
element and a thematic element. 



7. The method of claim 1 , wherein said metadata in- 
cludes a low-level annotation comprising a type in- so 
dicator, start time, duration or stop time, and a point- 
er to a label string. 

8. The method of claim 7 wherein the type indicator 
refers to a one selected from the group consisting S5 
at least from a person, event, object, or text. 

9. The method of claim 7 wherein the start and stop 
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